MotionGPT3: Human Motion as a Second Modality

MotionGPT3 is a hybrid motion-language model designed to process arbitrary input sequence of motion or language and generate outputs in either modality.

Abstract

With the rapid progress of large language models (LLMs), multimodal frameworks that unify understanding and generation have become promising, yet they face increasing complexity as the number of modalities and tasks grows. We observe that motion quantization introduces approximation errors that cap motion quality, and that unifying discrete text and continuous motion within a single-stream backbone amplifies cross-modal interference.

Motivated by recent multi-branch Transformer designs that separate signals from different modalities, we propose MotionGPT3, a bimodal motion-language model for both understanding and generation. MotionGPT3 encodes raw motion into a continuous latent space using a variational autoencoder (VAE), thereby avoiding quantization-induced artifacts, while leveraging the semantic prior of pretrained language models. A dual-stream Transformer with shared attention preserves modality-specific routes while enabling controlled, bidirectional information flow, which reduces interference, stabilizing optimization, and empirically accelerates convergence without degrading fidelity. For multimodal joint training, a generate-then-align three-stage schedule further improves stability and limits cross-task interference.

Experiments show that MotionGPT3 achieves 2× faster convergence in training loss and up to 4× faster convergence in validation, while maintaining state-of-the-art performance on standard motion understanding and motion generation benchmarks.

Our method

We introduce a three-stage alignment for our hybrid motion-language model. First the model learn to generate motion properly. Then we further align the motion branch with language by introducing motion reasoning. Finally, we fine-tune the model by joint training with unfrozen text modules.

Experiments

  • 2× faster training convergence vs. discrete & unified baselines while maintaining or improving quality.
  • State-of-the-art R@1/R@3 on HumanML3D text-to-motion; lower MMDist.
  • Cross-Modal Attention helps in a non-monotonic pattern, with last L layers enabled.
  • Competitive results with a smaller motion branch (~51M params) and modest text backbones (124M).

Training Speed

Line charts comparing training speed and quality; our method converges about 2× faster with better R@k and lower MMDist.
Continuous motion + bimodal architecture accelerates training by ~2× vs. discrete/unified variants while improving quality.

Text-to-Motion Comparison

Grouped bar charts of R@1, R@3, and MMDist across recent unified methods, with our model leading.
Evaluated on HumanML3D test split. Metrics: R@1/R@3 (higher is better) and MMDist (lower is better).

Ablations on Bimodal Architecture

a) Cross-Modal Attention
Curves of R@1/R@3 and MMDist as the number of last layers with cross-modal attention increases.
CMA enabled in the last L layers (L ∈ {1,…,6}). Performance improves up to L=5, then slightly degrades at L=6, indicating a non-monotonic pattern.
b) Model Params
Scatter/line charts relating text and motion branch parameters to R@k and MMDist on motion generation.
All models trained for 200K iterations. A 124M text branch is competitive with 355M/774M, and the motion branch achieves strong results with ~51M params (halved vs. larger variants).
 

Video

BibTeX